Red Wine Quality by Fahad Alhajjaj

Citation Request:

This dataset is public available for research.
The details are described in [Cortez et al., 2009].
Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:
Elsevier
Pre-press (pdf)
bib

In this report a dataset of 1599 red wine instances, each with 12 vriables
discribing the instance, is to be explored. A list of the 12 variables:
1. fixed acidity (tartaric acid - g / dm^3)
2. volatile acidity (acetic acid - g / dm^3)
3. citric acid (g / dm^3)
4. residual sugar (g / dm^3)
5. chlorides (sodium chloride - g / dm^3)
6. free sulfur dioxide (mg / dm^3)
7. total sulfur dioxide (mg / dm^3)
8. density (g / cm^3)
9. pH
10. sulphates (potassium sulphate - g / dm3)
11. alcohol (% by volume)
12. quality (score between 0 and 10)

Univariate Plots Section

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.class       : chr  "F" "F" "F" "E" ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality      quality.class     
##  Min.   :3.000   Length:1599       
##  1st Qu.:5.000   Class :character  
##  Median :6.000   Mode  :character  
##  Mean   :5.636                     
##  3rd Qu.:6.000                     
##  Max.   :8.000

The dataset has 1599 observation with 12 discribtive variables.
This a statestical summary of all variables is shown above and will be used
as referance and to understand the variables better

We see in the above histograms:
Fixed acidity has a normal distribution with a median 7.90 and a mean 8.32.
Volatile Acidity has a normal distribution with median 0.5200 and mean 0.5278.
The x-axis is different in both histograms due to the quantity of the two acids.

We see in the above histogram Citric Acid has two long bins at 0 and at 0.48
Also, Citric Acid has a median 0.260 and a mean 0.271.

We see in the above histogram for Residual Sugar has long tail due to outliers.
The Residual Sugar has a median 2.200 and a mean 2.539

In the above Chlorides Histogram we zoomed in to better understand the graph.
Chlorides Histogram has long tail due to outliers.
The Chlorides has a median 0.07900 and a mean 0.08747

In the above histograms we started with the Free Sulfur Dioxide Histogram,
then we did the Total Sulfur Dioxide Histogram, then we combined them togather
since the Free Sulfur Dioxide is part of the Total Sulfur Dioxide.
We noticed that the Free Sulfur Dioxide is mostly in the low levels of the gas.
The Free Sulfur Dioxide has a median 14.00 and a mean 15.87.
The Total Sulfur Dioxide has a median 38.00 and a mean 46.47.

In the above Density Histogram, Density has a normal distribution
with median 0.9968 and mean 0.9967.

In the above pH Histogram, pH has a normal distribution
with median 3.310 and mean 3.311.

We see in the above histogram for Sulphates has long tail due to outliers.
The Sulphates has a median 0.6200 and a mean 0.6581.

We see in the above histogram for Alcohol has a Positively-skewed distribution.
The Alcohol has a median of 10.20 and a mean of 10.42.

We see in the above histogram for Quality normal distribution.
The Quality has a median of 6.000 and a mean of 5.636.
We see most wines score a 5 or 6 in quality.

Univariate Analysis

What is the structure of your dataset?

There is 1599 red wine instances with 12 features:
(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides,
free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol,
quality). There is only one ordered variable, quality and quality.class.

(worst score) ———–> (best score)
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
K, J, I, H, G, F, E, D, C, B, A

Other observations:
* All wine instances score between 3 and 8.
* Both Residual Sugar and Chlorides have long tails.
* Minumum Alcohol % is 8.40 and Maximum Alcohol % is 14.90

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the dataset are fixed acidity and quality.
I’d like to see how the other variables effect the these two features.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

All other variables would help. Density and pH would have the a relation with
fixed acidity. The fixed acidity, volatile.acidity, alcohol would have the
most effect on quality.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Residual Sugar and Chlorides have long tails when I graphed them.
I zoomed in when I plot Chlorides Histogram to better understand the graph
because it had a long tail.

I applied some a transformation on the Residual Sugar it is heavily right skewed.
I took my reviewer advice on this matter.

I added a new variable called ‘quality.class’, I converted quality scale to
quality class by converting numbers (0 to 10) to letters (A - K) to better
understand and visualies the quality.

Bivariate Plots Section

This is an overview of the Bivaritate plots. It is used to better choose the
graphs and to understand the relationship between variables.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$fixed.acidity and red_wine$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

We see a positive correlation between Fixed Acidity and Citric Acid.
The Pearson’s product-moment correlation is 0.6717034.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$fixed.acidity and red_wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

We see a positive correlation between Fixed Acidity and Density.
The Pearson’s product-moment correlation is 0.6680473.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$fixed.acidity and red_wine$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

We see a negative correlation between Fixed Acidity and pH.
The Pearson’s product-moment correlation is -0.6829782.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$fixed.acidity and red_wine$alcohol
## t = -2.4691, df = 1597, p-value = 0.01365
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.11035580 -0.01268548
## sample estimates:
##         cor 
## -0.06166827

We see almost no correlation between Fixed Acidity and Alcohol.
The Pearson’s product-moment correlation is -0.06166827.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$volatile.acidity and red_wine$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

On the other hand, We see a negative correlation between Volatile Acidity and
Citric Acid. The Pearson’s product-moment correlation is -0.5524957.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

We see a positive correlation between Quality and Alcohol.
The Pearson’s product-moment correlation is 0.4761663.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

We see some negative correlation between Quality and Volatile Acidity
The Pearson’s product-moment correlation is -0.3905578.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

We almost see no correlation between Quality and pH
The Pearson’s product-moment correlation is -0.05773139.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

We almost see no correlation between Quality and Fixed Acidity
The Pearson’s product-moment correlation is 0.1240516.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

We saw some positive, negative and no correlation between the features of
interest and other features.

We saw positive correlations between:
Fixed Acidity and Citric Acid
Fixed Acidity and Density Quality and Alcohol

We saw negative correlations between:
Fixed Acidity and pH
Volatile Acidity and Citric Acid
Quality and Volatile Acidity

We saw no correlations between:
Fixed Acidity and Alcohol.
Quality and pH

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes, We see a negative correlation between Volatile Acidity and Citric Acid. The Pearson?s product-moment correlation is -0.5524957.
That was interesting.

What was the strongest relationship you found?

There was three strong relationships in the dataset:
Fixed Acidity and Citric Acid (positive relation - pearson’s correlation = 0.67)
Fixed Acidity and Density (positive relation - pearson’s correlation = 0.67)
Fixed Acidity and pH (negative relation - pearson’s correlation = -0.68)

Multivariate Plots Section

We see some positive coorelation with consentration in low values of x and y. We also see that quality wine gets better as the x and y values increases.

We are see some negative coorelation between Fixed Acidity vs. pH
We also quality is scttered all over the graph.

We see a positive correlation between Fixed Acidity and Density.
We mostly see the higher guality wines are lower than the lower quality wine.

We some positive correlation between Fixed Acidity/Volatile Acidity and
Citric Acid. Also, we can see some overlabing in the lower x and y values. We also see that quality wine gets better as the x and y values increases.

The lower the quality the smaller the quantile boxes.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

There was no relation between Fixed Acidity and Alcohol. However, when we
plotted the ration of Fixed Acidity over Volatile Acidity and Alcohol
we saw a positive coorelation.

Were there any interesting or surprising interactions between features?

Yes, the ration of Fixed Acidity over Volatile Acidity has some intersting
results when plotted with other features.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

## [1] "Fixed Acidity (g/dm^3) Statistics"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## [1] "Citric Acid (g/dm^3) Statistics"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## [1] "Quality Statistics"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$fixed.acidity and red_wine$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

Description One

We see a positive correlation between Fixed Acidity and Citric Acid.
The Pearson’s product-moment correlation is 0.6717034. Most of the higher quality wines (C-D) are above the regression line while
the lower quality (G-H) are below the regression line. However, quality wine
(E-F) are above and below the regression line. The points are scattered evenly
through the graph. I chose this graph because I wanted to see how the quality
is scattered in this graph. I was not surprised when I saw how quality was scattered in the graph. I droped the top 1 % of Fixed Acidity data because there
was some gaps in the values.

Plot Two

## [1] "Fixed Acidity (g/dm^3) Statistics"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## [1] "Density (g/cm^3) Statistics"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
## [1] "Quality Statistics"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$fixed.acidity and red_wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

Description Two

We see a positive correlation between Fixed Acidity and Density.
The Pearson’s product-moment correlation is 0.6680473.
Most of the higher quality wines (C-D) are below the regression line while
the lower quality (G-H) are along the regression line. However, quality wine
(E-F) are above and below the regression line where (F) is mostly above the line.
The points are scattered nicely in the graph. I chose this graph because
I wanted to see how the quality is scattered in this graph. I was not surprised
when I saw how quality was scattered in the graph. I droped the top 1 % of Fixed
Acidity data because there was some gaps in the values.

Plot Three

## [1] "Fixed Acidity (g/dm^3) Statistics"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## [1] "Alcohol (% by volume) Statistics"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## [1] "Quality Statistics"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$fixed.acidity and red_wine$alcohol
## t = -2.4691, df = 1597, p-value = 0.01365
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.11035580 -0.01268548
## sample estimates:
##         cor 
## -0.06166827

Description Three

We see almost no correlation between Fixed Acidity and Alcohol.
The Pearson’s product-moment correlation is -0.06166827.
Most of the higher quality wines (C-D) are above the regression line while
the lower quality (F-H) are below the regression line. Quality wine (E), on the
other hand, is scattered everywhere. The points are scattered nicely in the graph.
I chose this graph because I was fascinated by how alcohol has an effect on quality
I was surprised when I saw how quality was scattered in the graph. I did some
limitiation on the x-axis by taking out the top 1% of the values.


Reflection

I this project I worked on Red Wine Quality dataset.
The dataset has 1599 observations and 12 variables. I started by including a
new variable for quality class (A - K) converted from quality measure (0-10).
Then I started to examin each variable to better understand the dataset.

There was some coorelation between variables, some were positive and some were
negative. Some relations were obviouse, such as Fixed Acidity and pH, and some
were surprising to me, such as Fixed Acidity and Alcohol.

Some limition in the dataset were the number of observations.
The more observations we have the better understanding and exploration of the
dataset. To explore the dataset further, I would try and find the realstionship
between all features and the quality feature to be able to predict the quality
of a specific wine.